Dyr og Data

Statistical thinking — descriptive statistics

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-19

Descriptive statistics

Given a data set, how do we summarise the data?

What is the length of flippers in male and female Adelie penguins?

  • Males
    • 195, 201, 197, 186, 190, 195, 189, 196, 190, 195, 190, 198, 185, 189, 188
  • Females
    • 193, 187, 186, 196, 202, 178, 176, 199, 185, 190, 193, 190, 189, 195, 174

Describe properties of the data using summary or descriptive statistics

Descriptive statistics

Two main types of descriptive statistic; measures of location and spread

Measures of location or central tendency describe where the majority of the data are found, e.g.

  • mean (average)
  • median
  • mode

How variable the data are about this location is described by measures of spread

  • standard deviation (\(\sigma\))
  • variance (\(\sigma^2\))
  • inter-quartile range (IQR)

Mean or average

The mean, or average, is the sum of the observations, divided by the number of observations

\[\overline{y} = \frac{1}{n}\sum\limits^n_{i=1}y_i\]

Flipper lengths for male penguins

195 + 201 + 197 + 186 + 190 + 195 + 189 + 196 + 190 + 195 + 190 + 198 + 185 + 189 + 188 = 2884

2884 / 15 = 192.27

Mean or average

The mean is not a robust measure of central tendency. It is heavily influenced by extreme observations

195, 201, 197, 186, 190, 195, 189, 196, 190, 195, 190, 198, 185, 189, 526.4

Replaced a value of 188 with a score of 526.4

Mean of modified data set is 214.83, larger than all observations except the modified value

Median

A robust measure of central tendency should be less affected by extreme observations. The median is one such measure

The median is the value of a set of observations that has equal numbers of observations above and below it

If \(n\) is odd, median is the middle value of the ordered observations

If \(n\) is even, median is the midpoint between the \(n / 2\)th and \((n / 2) + 1\)th observations

  • Female flipper lengths: 193, 187, 186, 196, 202, 178, 176, 199, 185, 190, 193, 190, 189, 195, 174
  • Ordered lengths: 174, 176, 178, 185, 186, 187, 189, 190, 190, 193, 193, 195, 196, 199, 202
  • Median is 190

Median

If \(n\) is even, median is the midpoint between the \(n / 2\)th and \((n / 2) + 1\)th observations

  • First ten female flippers: 176, 178, 185, 186, 187, 190, 193, 196, 199, 202

Middle two observations are 187, 190

Median is 187 + 190 / 2 = 188.5

Measures of spread

Measures of spread describe the dispersion, or variability, of our observations

  • Variance — \(\sigma^2\)
  • Standard deviation — \(\sigma\)
  • Range
  • Inter-quartile range — IQR
  • Mean absolute deviation — MAD
  • Coefficient of variation — CV
  • Standard error or the mean — \(\sigma_{\overline{y}}\)

Measures of spread

Range

The range is a simple measure of spread and is the difference between the minimum and maximum observed values

For the male Adelie penguins 185, 186, 188, 189, 189, 190, 190, 190, 195, 195, 195, 196, 197, 198, 201

Range is \(201 - 185 = 16\)

For the female Adelie penguins

174, 176, 178, 185, 186, 187, 189, 190, 190, 193, 193, 195, 196, 199, 202

Range is \(202 - 174 = 28\)

Sample quantiles

Quantiles are points taken a regular intervals from the cumulative distribution function (CDF) of a random variable

Sample quantiles

Percentiles are a type of quantile — 90th percentile score means 90% of sampled scores are lower than this percentile

The median is a quantile — it is the 50th percentile

Quartiles break ordered sample into 4 regions

  • Lower quartile is the 25th percentile
  • Upper quartile is the 75th percentile

Deciles break the ordered sample in 10 regions

Values of the sample quantiles don’t depend on the mean or median of the sample

Interquartile range (IQR)

Range doesn’t tell us much about how variable observations are between the extremes & is heavily influenced by extreme values

Better measure would trim off some proportion of the smallest & largest observations before computing the range

The IQR is the difference between the upper (75th quantile) and lower (25th) quartiles of the data

185, 186, 188, 189, 189, 190, 190, 190, 195, 195, 195, 196, 197, 198, 201

Lower & upper quartiles are: 189, 195.5

IQR is: 6.5

Variance & standard deviation

For a random variable \(y\), the variance \(\sigma^2(y)\) is a measure of how far the observations differ from the expected value (the mean)

  • We will rarely have all the possible observations for a population
  • \(\sigma^2\) for the population is unknown, but can estimate it from our sample of data
    • This is the sample variance, often denoted \(s^2\) or \(\hat{\sigma}^2\)

How should we measure deviations from the mean? The sum of squares is a logical place to start

\[ \hat{\sigma}_y^2 = \sum_{i = 1}^n (y_i - \bar{y})^2 \]

Variance & standard deviation

The mean of the sum of squares is a value known as the mean square — uses average deviation from sample mean as estimate of population variance

\[ \hat{\sigma}_y^2 = \frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})^2 = \frac{\displaystyle \sum_{i = 1}^n (y_i - \bar{y})^2}{n} \]

Problem: mean square is a biased estimator of \(\sigma^2\)

  • For a single observation, \(y_1\), the observed mean is just \(y_1\)
  • The mean square for this dataset would be 0, which doesn’t seem right
  • Biased low

Leads to the concept of degrees of freedom — the number of independent pieces of information that we can use to estimate statistical parameters

Variance & standard deviation

For an unbiased estimate of \(\sigma^2\), divide sums of squares by \(n-1\) not \(n\)

Why \(n - 1\)? Usual answer: We have used 1 parameter to estimate the mean; we have \(n - 1\) independent pieces of information left

Sample variance is defined as

\[ \hat{\sigma}_y^2 = \frac{1}{n - 1} \sum_{i=1}^n (y_i - \bar{y})^2 \]

Variance is measured in squared units relative to \(y\)

Variance & standard deviation

Taking the square root of \(\hat{\sigma}^2\) gives the standard deviation, which is measured on the same scale as \(y\)

\[ \hat{\sigma} = \sqrt{\hat{\sigma}^2} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^n (y_i - \bar{y})^2} \]

Now \(\hat{\sigma}\) is measured in the same units as \(y\) — easier to understand

Note we need at least two observations to compute \(\hat{\sigma}^2\) and \(\hat{\sigma}\)

Variance & standard deviation

Mean male flipper length: 192.27

Deviations from mean:

2.73, 8.73, 4.73, -6.27, -2.27, 2.73, -3.27, 3.73, -2.27, 2.73, -2.27, 5.73, -7.27, -3.27, -4.27

Squared deviations from mean:

7.47, 76.27, 22.4, 39.27, 5.14, 7.47, 10.67, 13.94, 5.14, 7.47, 5.14, 32.87, 52.8, 10.67, 18.2

Sum of squares: 314.93

Variance: \(\hat{\sigma}^2 = \frac{314.93}{15 - 1}\) = 22.5

Standard deviation: \(s = \sqrt{\frac{314.93}{15 - 1}}\) = 4.74

Mean absolute deviation (MAD)

A robust alternative to \(\sigma^2\) and \(\sigma\) is the mean absolute deviation (MAD)

Rather than using squared differences, MAD uses absolute deviations form the sample mean

\[\text{MAD} = \frac{1}{n} \sum_{i=1}^n \left | y_i - \bar{y} \right |\]

Mean absolute deviation (MAD)

Mean coping score for stroke patients: 192

Deviations from mean:

2.73, 8.73, 4.73, -6.27, -2.27, 2.73, -3.27, 3.73, -2.27, 2.73, -2.27, 5.73, -7.27, -3.27, -4.27

Absolute deviations from mean:

2.73, 8.73, 4.73, 6.27, 2.27, 2.73, 3.27, 3.73, 2.27, 2.73, 2.27, 5.73, 7.27, 3.27, 4.27

Sum of absolute deviations: 62.27

MAD: \(s^2 = \frac{62.27}{15 - 1}\) = 7.41

Coefficient of variation

Difficult to compare standard deviations of two samples if measured in different units — \(\sigma\) depends upon the sample mean

The coefficient of variation is the standard deviation divided by the sample mean. Often multiplied by 100 to express result as a percentage

\[ \text{CV} = \left ( \frac{\hat{\sigma}}{\bar{y}} \right ) \times 100 = \left (\frac{\displaystyle \sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2}}{\displaystyle \frac{1}{n}\sum\limits^n_{i=1}y_i} \right ) \]

Skewness & kurtosis

Two other descriptive statistics you might encounter are skewness & kurtosis

They describe departures from symmetry compared with the Gaussian distribution

Skewness

Skewness describes how far the sample differs from a symmetrical distribution

  • Right skew is where tail extends to right, positive skewness
  • Left skew is where tail extends to left, negative skewness

Kurtosis

Kurtosis describes how probability density is distribution in the tails of a distribution

  • Platykurtosis; where more mass in centre of distribution , negative kurtosis
  • Leptokurtosis; where more mass in tails of distribution , positive kurtosis

Skewness & kurtosis

Rarely mentioned in the literature but some software will churn these out alongside other descriptive statistics

Statistical properties are not very good; sensitive to outliers & to difference in the means of distributions being compared

A plot will suffice

Standard deviation & standard error

The standard deviation & the standard error of the mean are often confused

The standard deviation (\(\hat{\sigma}\)) is a measure of the deviation of observations about the mean

The standard error (of the mean) is a measure of how variable or uncertain the estimate (\(\bar{y}\)) of the population mean (\(\mu\)) is

\[\hat{\sigma}_{\overline{y}} = \frac{\hat{\sigma}}{\sqrt{n}}\]

A large standard error, relative to the size of the mean, would indicate lots of variability in the means we would observe if we took a large number of samples (of the same size) from the population

Standard deviation & error